Applied predictive modelling book - Chap 6: Linear regression

Tidymodels workflow

lruolin
11-25-2021

Linear regression:

For linear regression, the number of predictors must not exceed number of observations.

There must not be any missing data for the outcome variable (if so, they should be imputed). Data must be centered and scaled, and correlated variables should be removed (either by subset selection, shrinkage methods - lasso or ridge regression, or by the use of dimensionality reduction - PCA or PLS)

The exercise below is done using the dataset from Applied Predictive Modelling book, and the workflow followed that outlined in this youtube video, and I also took reference from Julia Silge’s blogpost on lasso regression for The Office.

Load packages

library(pacman)
p_load(tidymodels, tidyverse, janitor, caret, corrr, glmnet, ISLR, skimr)

IR Spectroscopy

"These data are recorded on a Tecator Infratec Food and Feed Analyzer working in the wavelength range 850 - 1050 nm by the Near Infrared Transmission (NIT) principle. Each sample contains finely chopped pure meat with different moisture, fat and protein contents.

“For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry.”

Load data

data(tecator)

absorbance_df <- as.tibble(absorp) %>% 
  clean_names()

# glimpse(absorbance_df) # absorbance

endpoints_df <- as.tibble(endpoints) %>% 
  rename("water" = V1,
         "fat" = V2,
         "protein" = V3)

glimpse(endpoints_df)
Rows: 215
Columns: 3
$ water   <dbl> 60.50, 46.00, 71.00, 72.80, 58.30, 44.00, 44.00, 69.…
$ fat     <dbl> 22.5, 40.1, 8.4, 5.9, 25.5, 42.7, 42.7, 10.6, 19.9, …
$ protein <dbl> 16.7, 13.5, 20.5, 20.7, 15.5, 13.7, 13.7, 19.3, 17.7…
ir_df <- bind_cols(endpoints_df, absorbance_df) %>% 
  select(-water, -protein) # to predict fat content from IR spectrum

# glimpse(ir_df)

EDA

ir_df %>% 
  ggplot(aes(fat)) + 
  geom_freqpoly() +
  theme_classic()
summary(ir_df$fat) # fat (Y) is left skewed
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.90    7.30   14.00   18.14   28.00   49.10 
# X variables are very correlated
ir_df %>% 
  select(-fat) %>% 
  corrr::correlate() %>% 
  corrr::rplot()

Split

set.seed(2021112501)
ir_split <- initial_split(ir_df)

ir_training <- training(ir_split)
ir_testing <- testing(ir_split)

# cross validation set
set.seed(2021112502)
ir_cv <- vfold_cv(ir_training)

Linear regression

Define recipe

lm_recipe <- 
  recipe(fat ~ ., data = ir_training) %>% 
 # step_sqrt(fat) %>% 
  #step_center(all_predictors()) # centering is recommended for spectral data to minimise noise
  
  step_normalize(everything())

lm_recipe 
Recipe

Inputs:

      role #variables
   outcome          1
 predictor        100

Operations:

Centering and scaling for everything()

Define models

# To set a linear model, but to tune to see whether to use OLS, ridge or lasso, or elasticnet
lm_model <- linear_reg(penalty = tune(),
                       mixture = tune()) %>% 
            set_engine("glmnet")

lm_model
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet 

Tune for linear regression model

# for penalty aka tuning parameter = 0, model is the same as OLS
# for penalty aka tuning parameter = 1, model is a null model without predictors

# for ridge regression, mixture (alpha) = 0
# for lasso regression, mixture (alpha) = 1

grid <- expand_grid(
  penalty = seq(0, 100, by = 10),
  mixture = seq(0, 1, by = 0.1)
)

results <- tune_grid(
  lm_model,
  preprocessor = lm_recipe,
  grid = grid,
  resamples = ir_cv
)

results %>% 
  show_best(metric = "rmse")
# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config      
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>        
1       0     0.5 rmse    standard   0.241    10  0.0145 Preprocessor…
2       0     0.4 rmse    standard   0.241    10  0.0141 Preprocessor…
3       0     0.2 rmse    standard   0.243    10  0.0139 Preprocessor…
4       0     0.7 rmse    standard   0.247    10  0.0140 Preprocessor…
5       0     0.1 rmse    standard   0.247    10  0.0132 Preprocessor…
results %>% 
  show_best(metric = "rsq")
# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config      
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>        
1       0     0.5 rsq     standard   0.950    10 0.00560 Preprocessor…
2       0     0.4 rsq     standard   0.950    10 0.00576 Preprocessor…
3       0     0.2 rsq     standard   0.949    10 0.00590 Preprocessor…
4       0     0.7 rsq     standard   0.948    10 0.00577 Preprocessor…
5       0     0.1 rsq     standard   0.948    10 0.00605 Preprocessor…
lm_best_param <- results %>% 
  select_best(metric = "rmse")

lm_best_param
# A tibble: 1 × 3
  penalty mixture .config               
    <dbl>   <dbl> <chr>                 
1       0     0.5 Preprocessor1_Model056
lm_best_model <- 
  linear_reg(penalty = 0, mixture = 0.5) %>% 
  set_engine("glmnet")

lm_final_workflow <- 
  workflow() %>% 
  add_model(lm_best_model) %>% 
  add_recipe(lm_recipe)

lm_final_workflow
══ Workflow ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ──────────────────────────────────────────────────────
1 Recipe Step

• step_normalize()

── Model ─────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0
  mixture = 0.5

Computational engine: glmnet 
# final fit to training dataset
lm_final_workflow %>% 
  fit(ir_training) %>% 
  extract_fit_parsnip() %>% 
  vip::vip(num_features = 20) +
  theme_classic()
# These are the wavelengths of interest
seq(850,1050,2) %>% 
  as_tibble() %>% 
  rowid_to_column() %>% 
  filter(rowid %in% c(1:3, 39:42))
# A tibble: 7 × 2
  rowid value
  <int> <dbl>
1     1   850
2     2   852
3     3   854
4    39   926
5    40   928
6    41   930
7    42   932
# predict
lm_fit <- lm_best_model %>% 
  fit(fat ~., data = ir_df)

tidy(lm_fit)
# A tibble: 101 × 3
   term        estimate penalty
   <chr>          <dbl>   <dbl>
 1 (Intercept)   16.8         0
 2 v1            47.1         0
 3 v2            23.5         0
 4 v3             8.33        0
 5 v4             9.26        0
 6 v5             0.175       0
 7 v6             0           0
 8 v7             0           0
 9 v8             0           0
10 v9             0           0
# … with 91 more rows
# last fit and evaluate on test dataset
final_fit <- 
  lm_final_workflow %>% 
  last_fit(ir_split)

final_fit %>% 
  collect_metrics() # on test dataset. rmse: 0.246, rsq: 0.934
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.246 Preprocessor1_Model1
2 rsq     standard       0.934 Preprocessor1_Model1
# predicted fat values
predict(lm_fit, ir_testing)
# A tibble: 54 × 1
   .pred
   <dbl>
 1 19.6 
 2 10.6 
 3 24.0 
 4 15.3 
 5  4.44
 6  9.47
 7  9.13
 8  7.59
 9  9.94
10 17.4 
# … with 44 more rows
ir_aug <- 
  augment(lm_fit, ir_testing)


# glimpse(ir_aug)

# visualizing predicted fat vs actual fat content for test dataset
ir_aug %>% 
  ggplot(aes(fat, .pred)) +
  geom_point(col = "darkblue", size = 3) +
  geom_abline(lty = 2, col = "grey50", size = 2) +
  scale_x_continuous(expand = c(0,0), limits = c(0, 60)) +
  scale_y_continuous(expand = c(0,0), limits = c(0, 60)) +
  labs(title = "Predicted vs Actual Fat Content for Test Dataset",
       x = "Actual Fat %",
       y = "Predicted Fat %") +
  theme_classic() +
  coord_equal()
# accuracy is slightly off for higher fat % samples - tendency to over-estimate.

For PLS, I attempted to do it earlier here. The workflow is the same, except that the preprocessing did not normalise everything.

Reference

Citation

For attribution, please cite this work as

lruolin (2021, Nov. 25). pRactice corner: Applied predictive modelling book - Chap 6: Linear regression. Retrieved from https://lruolin.github.io/myBlog/posts/20211125 - Linear regression  (Chap 6 Applied Predictive Modelling)/

BibTeX citation

@misc{lruolin2021applied,
  author = {lruolin, },
  title = {pRactice corner: Applied predictive modelling book - Chap 6: Linear regression},
  url = {https://lruolin.github.io/myBlog/posts/20211125 - Linear regression  (Chap 6 Applied Predictive Modelling)/},
  year = {2021}
}